Load the dataset
The collected dataset includes information about houses on sale in the Dublin area. Each house is an entry of the dataset: a mixed-type data comprising of numerical, categorical and textual data.
The goal is to combine both numerical/categorical features and textual features to predict the house-price.
The house price is determined by some factors like
- location (area),
- surface (size),
- the number of bedrooms,
- the number of bathrooms,
- property type,
- house-features (size of the windows, construction material).
The physical attributes of the house such as the number of bedrooms, the number of bathrooms, the surface of the house, property type, and its location are information that is directly accessible from the dataset. Instead, the house-features can (sometimes only indirectly) be inferred from the house-description, house-facility and house-features. You can see a typical entry in the dataset hereafter
data <- read.csv(file = 'train.csv',sep="," )
data[10:28,3:17]#one of the entries, there are 17 columns, the first two columns are just ids
Data Cleaning, Covariate selection and preprocessing
We select some of the columns (‘bathrooms’,‘beds’,‘surface’) we will use as predictors for price
datasel = data[c('bathrooms','beds','surface','price')]
datasel = na.omit(datasel)# we remove all the rows including nan
datasel
Linear regression
We now fit linear regression
model = lm(price ~ bathrooms + beds + surface, data = datasel)
summary(model)
Call:
lm(formula = price ~ bathrooms + beds + surface, data = datasel)
Residuals:
Min 1Q Median 3Q Max
-3228439 -195188 -56583 79729 7778095
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.288e+05 2.437e+04 -5.283 1.39e-07 ***
bathrooms 1.361e+05 1.178e+04 11.557 < 2e-16 ***
beds 1.392e+05 1.030e+04 13.515 < 2e-16 ***
surface 7.898e+00 2.323e+00 3.400 0.000684 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 501600 on 2398 degrees of freedom
Multiple R-squared: 0.3044, Adjusted R-squared: 0.3035
F-statistic: 349.8 on 3 and 2398 DF, p-value: < 2.2e-16
Is this a good model? Can we use other columns in data to improve the model? Can we include polynomial and/or interaction terms to improve the model? Use the model selection approaches you learned in session 5 to find a better model.
Unseen data
You can test the predictive performance of our best model on unseen data
datatest <- read.csv(file = 'test.csv',sep="," )
datatest[10:28,3:16]#one of the entries, there are 16 columns, the first two columns are just ids. The price column is not reported. You have to predict the price for all the entries in dataset
Prediction
predictions <- predict(model,datatest)
predictions[1:5]
1 2 3 4 5
701423.2 561985.5 837755.7 834321.9 425684.7
these are the predicted prices for 5 houses in the dataset. You can save and submit your best predictions for our internal data science competition. This is the code:
write.csv(predictions,"name_surname.csv")
We will use MAPE: Mean absolute percentage error, to evaluate the accuracy of your predictions.
LS0tCnRpdGxlOiAiU2Vzc2lvbiA1OiBsaW5lYXIgcmVncmVzc2lvbiIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBMb2FkIHRoZSBkYXRhc2V0IApUaGUgY29sbGVjdGVkIGRhdGFzZXQgaW5jbHVkZXMgaW5mb3JtYXRpb24gYWJvdXQgaG91c2VzIG9uIHNhbGUgaW4gdGhlIER1YmxpbiBhcmVhLiBFYWNoIGhvdXNlIGlzIGFuIGVudHJ5IG9mIHRoZSBkYXRhc2V0OiBhIG1peGVkLXR5cGUgZGF0YSBjb21wcmlzaW5nIG9mIG51bWVyaWNhbCwgY2F0ZWdvcmljYWwgYW5kIHRleHR1YWwgZGF0YS4KClRoZSBnb2FsIGlzIHRvIGNvbWJpbmUgYm90aCBudW1lcmljYWwvY2F0ZWdvcmljYWwgZmVhdHVyZXMgYW5kIHRleHR1YWwgZmVhdHVyZXMgdG8gcHJlZGljdCB0aGUgaG91c2UtcHJpY2UuCgpUaGUgaG91c2UgcHJpY2UgaXMgZGV0ZXJtaW5lZCBieSBzb21lIGZhY3RvcnMgbGlrZQoKKiBsb2NhdGlvbiAoYXJlYSksCiogc3VyZmFjZSAoc2l6ZSksCiogdGhlIG51bWJlciBvZiBiZWRyb29tcywKKiB0aGUgbnVtYmVyIG9mIGJhdGhyb29tcywKKiBwcm9wZXJ0eSB0eXBlLAoqIGhvdXNlLWZlYXR1cmVzIChzaXplIG9mIHRoZSB3aW5kb3dzLCBjb25zdHJ1Y3Rpb24gbWF0ZXJpYWwpLgoKVGhlIHBoeXNpY2FsIGF0dHJpYnV0ZXMgb2YgdGhlIGhvdXNlIHN1Y2ggYXMgdGhlIG51bWJlciBvZiBiZWRyb29tcywgdGhlIG51bWJlciBvZiBiYXRocm9vbXMsIHRoZSBzdXJmYWNlIG9mIHRoZSBob3VzZSwgcHJvcGVydHkgdHlwZSwgYW5kIGl0cyBsb2NhdGlvbiBhcmUgaW5mb3JtYXRpb24gdGhhdCBpcyBkaXJlY3RseSBhY2Nlc3NpYmxlIGZyb20gdGhlIGRhdGFzZXQuCkluc3RlYWQsIHRoZSBob3VzZS1mZWF0dXJlcyBjYW4gKHNvbWV0aW1lcyBvbmx5IGluZGlyZWN0bHkpIGJlIGluZmVycmVkIGZyb20gdGhlIGhvdXNlLWRlc2NyaXB0aW9uLCBob3VzZS1mYWNpbGl0eSBhbmQgaG91c2UtZmVhdHVyZXMuCllvdSBjYW4gc2VlIGEgdHlwaWNhbCBlbnRyeSBpbiB0aGUgZGF0YXNldCBoZXJlYWZ0ZXIKCmBgYHtyfQpkYXRhIDwtIHJlYWQuY3N2KGZpbGUgPSAndHJhaW4uY3N2JyxzZXA9IiwiICkKZGF0YVsxMDoyOCwzOjE3XSNvbmUgb2YgdGhlIGVudHJpZXMsIHRoZXJlIGFyZSAxNyBjb2x1bW5zLCB0aGUgZmlyc3QgdHdvIGNvbHVtbnMgYXJlIGp1c3QgaWRzCmBgYAoKIyBEYXRhIENsZWFuaW5nLCBDb3ZhcmlhdGUgc2VsZWN0aW9uIGFuZCBwcmVwcm9jZXNzaW5nCldlIHNlbGVjdCBzb21lIG9mIHRoZSBjb2x1bW5zICgnYmF0aHJvb21zJywnYmVkcycsJ3N1cmZhY2UnKSB3ZSB3aWxsIHVzZSBhcyBwcmVkaWN0b3JzIGZvciBwcmljZQpgYGB7cn0KZGF0YXNlbCA9IGRhdGFbYygnYmF0aHJvb21zJywnYmVkcycsJ3N1cmZhY2UnLCdwcmljZScpXQpkYXRhc2VsID0gbmEub21pdChkYXRhc2VsKSMgd2UgcmVtb3ZlIGFsbCB0aGUgcm93cyBpbmNsdWRpbmcgbmFuCmRhdGFzZWwKYGBgCgojIExpbmVhciByZWdyZXNzaW9uCldlIG5vdyBmaXQgbGluZWFyIHJlZ3Jlc3Npb24KYGBge3J9Cm1vZGVsID0gbG0ocHJpY2UgfiBiYXRocm9vbXMgKyBiZWRzICsgc3VyZmFjZSwgZGF0YSA9IGRhdGFzZWwpCnN1bW1hcnkobW9kZWwpCmBgYApJcyB0aGlzIGEgZ29vZCBtb2RlbD8gQ2FuIHdlIHVzZSBvdGhlciBjb2x1bW5zIGluIGBkYXRhYCB0byBpbXByb3ZlIHRoZSBtb2RlbD8KQ2FuIHdlIGluY2x1ZGUgcG9seW5vbWlhbCBhbmQvb3IgaW50ZXJhY3Rpb24gdGVybXMgdG8gaW1wcm92ZSB0aGUgbW9kZWw/ClVzZSB0aGUgbW9kZWwgc2VsZWN0aW9uIGFwcHJvYWNoZXMgeW91IGxlYXJuZWQgaW4gc2Vzc2lvbiA1IHRvIGZpbmQgYSBiZXR0ZXIgbW9kZWwuCgoKIyBVbnNlZW4gZGF0YQpZb3UgY2FuIHRlc3QgdGhlIHByZWRpY3RpdmUgcGVyZm9ybWFuY2Ugb2Ygb3VyIGJlc3QgbW9kZWwgb24gdW5zZWVuIGRhdGEKYGBge3J9CmRhdGF0ZXN0IDwtIHJlYWQuY3N2KGZpbGUgPSAndGVzdC5jc3YnLHNlcD0iLCIgKQpkYXRhdGVzdFsxMDoyOCwzOjE2XSNvbmUgb2YgdGhlIGVudHJpZXMsIHRoZXJlIGFyZSAxNiBjb2x1bW5zLCB0aGUgZmlyc3QgdHdvIGNvbHVtbnMgYXJlIGp1c3QgaWRzLiBUaGUgcHJpY2UgY29sdW1uIGlzIG5vdCByZXBvcnRlZC4gWW91IGhhdmUgdG8gcHJlZGljdCB0aGUgcHJpY2UgZm9yIGFsbCB0aGUgZW50cmllcyBpbiBkYXRhc2V0CmBgYAoKUHJlZGljdGlvbgpgYGB7cn0KcHJlZGljdGlvbnMgPC0gcHJlZGljdChtb2RlbCxkYXRhdGVzdCkKcHJlZGljdGlvbnNbMTo1XQpgYGAKdGhlc2UgYXJlIHRoZSBwcmVkaWN0ZWQgcHJpY2VzIGZvciA1IGhvdXNlcyBpbiB0aGUgZGF0YXNldC4gWW91IGNhbiBzYXZlIGFuZCBzdWJtaXQgeW91ciBiZXN0IHByZWRpY3Rpb25zIGZvciBvdXIgaW50ZXJuYWwgZGF0YSBzY2llbmNlIGNvbXBldGl0aW9uLiBUaGlzIGlzIHRoZQpjb2RlOgoKYGBge3J9CndyaXRlLmNzdihwcmVkaWN0aW9ucywibmFtZV9zdXJuYW1lLmNzdiIpCmBgYApXZSB3aWxsIHVzZSBNQVBFOiBNZWFuIGFic29sdXRlIHBlcmNlbnRhZ2UgZXJyb3IsIHRvIGV2YWx1YXRlIHRoZSBhY2N1cmFjeSBvZgp5b3VyIHByZWRpY3Rpb25zLg==